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Abstract 

The CFG recognition problem is: given a context-free grammar Q and a string w of length 
n, decide if w can be obtained from Q. This is the most basic parsing question and is a core 
computer science problem. Valiant’s parser from 1975 solves the problem in 0(n u ) time, where 
u> < 2.373 is the matrix multiplication exponent. Dozens of parsing algorithms have been 
proposed over the years, yet Valiant’s upper bound remains unbeaten. The best combinatorial 
algorithms have mildly subcubic 0(n 3 /log 3 n) complexity. 

Lee (JACM’01) provided evidence that fast matrix multiplication is needed for CFG parsing, 
and that very efficient and practical algorithms might be hard or even impossible to obtain. 
Lee showed that any algorithm for a more general parsing problem with running time 0(|(?| • 
n 3 ~ e ) can be converted into a surprising subcubic algorithm for Boolean Matrix Multiplication. 
Unfortunately, Lee’s hardness result required that the grammar size be \Q\ = fi(n 6 ). Nothing 
was known for the more relevant case of constant size grammars. 

In this work, we prove that any improvement on Valiant’s algorithm, even for constant size 
grammars, either in terms of runtime or by avoiding the inefficiencies of fast matrix multipli¬ 
cation, would imply a breakthrough algorithm for the fc-Clique problem: given a graph on n 
nodes, decide if there are k that form a clique. 

Besides classifying the complexity of a fundamental problem, our reduction has led us to 
similar lower bounds for more modern and well-studied cubic time problems for which faster 
algorithms are highly desirable in practice: RNA Folding , a central problem in computational 
biology, and Dyck Language Edit Distance , answering an open question of Saha (FOCS’14). 


1 Introduction 


Context-free grammars (CFG) and languages (CFL), introduced by Chomsky in 1956 jCho59] . play 
a fundamental role in computability theory |Sip96|, formal language theory iHMUOlij . programming 
languages [ASU86] , natural language processing -l.MOO . and computer science in general with 
applications in diverse areas such as computational biology ( DEKM98] and databases [KSSY13] . 
They are essentially a sweet spot between very expressive languages (like natural languages) that 
computers cannot parse well, and the more restrictive languages (like regular languages) that even 
a DFA can parse. 

In this paper, we will be concerned with the following very basic definitions. A CFG Q in 
Chomsky Normal Form over a set of terminals (i.e. alphabet) X consists of a set of nonterminals 
T, including a specified starting symbol SeT, and a set of productions (or derivation rules) of 
the form A —>• B C or A —>• a for some A, B, C E T and a E X. Each CFG Q defines a CFL 
C(Q) of strings in X* that can be obtained by starting with S and recursively applying arbitrary 
derivation rules from the grammar. The CFG recognition problem is: given a CFG Q and a string 
w E X* determine if w can be obtained from Q (i.e. whether w E C(Q)). The problem is of most 
fundamental and practical interest when we restrict Q to be of fixed size and let the length of the 
string n = |tn| to be arbitrary. 

The main question we will address in this work is: what is the time complexity of the CFG 
recognition problem? 

Besides the clear theoretical importance of this question, the practical motivation is overwhelm¬ 
ing. CFG recognition is closely related to the parsing problem in which we also want to output a 
possible derivation sequence of the string from the grammar (if w E C{Q)). Parsing is essential: 
this is how computers understand our programs, scripts, and algorithms. Any algorithm for parsing 
solves the recognition problem as well, and Ruzzo |R,uz79j showed that CFG recognition is at least 
as hard as parsing, at least up to logarithmic factors, making the two problems roughly equivalent. 

Not surprisingly, the critical nature of CFG recognition has led to the development of a long 
list of clever algorithms for it, including classical works |Val75l lEar701 ICS701 IYou67l IKas65l lKn u65. 
lDeR69i IBak79l lLan74] . and the search for practical parsing algorithms, that work well for varied 
applications, is far from over [PK091IRSCJIOI [SBMN131 ICSC13j . For example, the canonical CYK 
algorithm from the 1960’s jCS701 lKas651 [You67| constructs a dynamic programming table D of size 
n x n such that cell D(i,j) contains the list of all nonterminals that can produce the substring of w 
from position i to position j. The table can be computed with linear time per entry, by enumerating 
all derivation rules A^BC and checking whether for some i < k < j, D(i,k) contains B and 
D(k + l.j) contains C (and if so, add A to D(i,j)). This gives an upper bound of 0(n 3 ) for the 
problem. Another famous algorithm is Earley’s from 1970 |Ear70j which proceeds by a top-down 
dynamic programming approach and could perform much faster when the grammar has certain 
properties. Variants of Earley’s algorithm were shown to run in mildly subcubic 0(n 3 /log 1 2 n) time 
|GRH80l |Ryt85 . 0 

In 1975 a big theoretical breakthrough was achieved by Valiant |Val75| who designed a sophisti¬ 
cated recursive algorithm that is able to utilize many fast boolean matrix multiplications to speed 
up the computation of the dynamic programming table from the CYK algorithm. The time com¬ 
plexity of the CFG problem decreased to 0(g 2 n u ), where u < 2.373 is the matrix multiplication 


1 As typical, we distinguish between “truly subcubic” runtimes, 0(n 3 

for all other subcubic runtimes. 


: ) for constant e > 0, and “mildly subcubic” 
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exponent |Will2L fGall4] and g is the size of the grammar; because in most applications g = 0(1), 
Valiant’s runtime is often cited as OinF). In 1995 Rytter |Ryt95| described this algorithm as 
“probably the most interesting algorithm related to formal languages ” and it is hard to argue with 
this quote even today, 40 years after Valiant’s result. Follow-up works proposed simplifications 
of the algorithm Ryt95], generalized it to stochastic CFG parsing |BS07] . and applied it to other 
problems [ Aku98l [ZTZIJlOj . 

Despite its vast academic impact, Valiant’s algorithm has enjoyed little success in practice. 
The theoretically fastest matrix multiplication algorithms are not currently practical, and Valiant’s 
algorithm can often be outperformed by “combinatorial” methods in practice, even if the most 
practical truly subcubic fast matrix multiplication algorithm (Strassen ’s Si r(>9l ) is used. Theoret¬ 
ically, the fastest combinatorial algorithms for Boolean Matrix Multiplication (BMM) run in time 
0(n 3 / log 4 n) [Chal51 IYul5j . To date, no combinatorial algorithm for BMM or CFG recognition 
with truly subcubic running time is known. 

In the absence of efficient algorithms and the lack of techniques for proving superlinear uncon¬ 
ditional lower bounds for any natural problem, researchers have turned to conditional lower bounds 
for CFG recognition and parsing. Since the late 1970’s, Harrison and Havel jHH74] observed that 
any algorithm for the problem would imply an algorithm that can verify a Boolean matrix multipli¬ 
cation of two y/n x yfn, matrices. This reduction shows that a combinatorial 0(?r L5-e ) recognition 
algorithm would imply a breakthrough subcubic algorithm for BMM. Ruzzo |Ruz79| showed that 
a parsing algorithm that says whether each prefix of the input string is in the language, could 
even compute the BMM of two yfn x yfn matrices. Even when only considering combinatorial 
algorithms, these fl(n L5 ) lower bounds left a large gap compared to the cubic upper bound. A 
big step towards tight lower bounds was in the work of Satta (Sat94j on parsing Tree Adjoining 
Grammars, which was later adapted by Lee [Lee02| to prove her famous conditional lower bound 
for CFG parsing. Lee proved that BMM of two n x n matrices can be reduced to “parsing” a 
string of length 0(n 1,/3 ) with respect to a CFG of size @(n 2 ), where the parser is required to say 
for each nonterminal T and substring w[i : j] whether T can derive w[i : j] in a valid derivation of 
w from the grammar. This reduction proves that such parsers cannot be combinatorial and run in 
0{gn 3 ~ £ ) time without implying a breakthrough in BMM algorithms. 

Lee’s result is important, however suffers from significant limitations which have been pointed 
out by many researchers (e.g. |Ruz791 ILee02l ISahl4al ISahl4b] L We describe a few of these below. 
Despite the limitations, however, the only progress after Lee’s result is a recent observation by Saha 
|Sahl5] that one can replace BMM in Lee’s proof with APSP by augmenting the production rules 
with probabilities, thus showing an APSP-based lower bound for Stochastic CFG parsing. Because 
Saha uses Lee’s construction, her lower bound suffers from exactly the same limitations. 

The first (and most major) limitation of Lee’s lower bound is that it is irrelevant unless the size 
of the grammar is much larger than the string, in particular it is cubic only when g = Q(n 6 ). A 
CFG whose description needs to grow with the input string does not really define a CFL, and as 
Lee points out, this case can be unrealistic in many applications. In programming languages, for 
instance, the grammar size is much smaller than the programs one is interested in, and in fact the 
grammar can be hardcoded into the parser. A parsing algorithm that runs in time 0(g 3 n), which is 
not ruled out by Lee’s result, could be much more appealing than one that runs in 0(gn 2 ' 5 ) time. 

The second limitation of both Lee’s and Ruzzo’s lower bounds is the quite demanding require¬ 
ment from the parser to provide extra information besides returning some parse tree. These lower 
bounds do not hold for recognizers nor any parser with minimal but meaningful output. 
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Theoretically, it is arguably more fundamental to ask: what is the time complexity of CFG 
recognition and parsing that can be obtained by any algorithm, not necessarily combinatorial? 
Lee’s result cannot give a meaningful answer to this question. To get a new upper bound for BMM 
via Lee’s reduction, one needs a parser that runs in near-linear time. Lee’s result does not rule 
out, say, an 0(n 1,n ) time parser that uses fast matrix multiplication; such a parser would be an 
amazing result. 

Our first observation is that to answer these questions and understand the complexity of CFG 
recognition, we may need to find a problem other than BMM to reduce from. Despite the apparent 
similarity in complexities of both problems - we expect both to be cubic for combinatorial algorithms 
and 0(n u ) for unrestricted ones - there is a big gap in complexities because of the input size. When 
the grammar is fixed, a reduction cannot encode any information in the grammar and can only use 
the n letters of the string, i.e. 0(n ) bits of information, while an instance of BMM requires 0(n 2 ) 
bits to specify. Thus, at least with respect to reductions that produce a single instance of CFG 
recognition, we do not expect BMM to imply a higher than fl(n 1,5 ) lower bound for parsing a fixed 
size grammar. 

Main Result In this paper we present a tight reduction from the /c-Clique problem to the recog¬ 
nition of a fixed CFG and prove a new lower bound for CFG recognition that overcomes all the 
above limitations of the previously known lower bounds. Unless a breakthrough /c-Clique algorithm 
exists, our lower bound completely matches Valiant’s 40-year-old upper bound for unrestricted al¬ 
gorithms and completely matches CYK and Earley’s for combinatorial algorithms, thus resolving 
the complexity of CFG recognition even on fixed size grammars. 

Before formally stating our results, let us give some background on /c-Clique. This fundamental 
graph problem asks whether a given undirected unweighted graph on n nodes and 0(n 2 ) edges con¬ 
tains a clique on k nodes. This is the parameterized version of the famously NP-hard Max-Clique 
(or equivalently, Max-Independent-Set) [Kar72| . /c-Clique is amongst the most well-studied prob¬ 
lems in theoretical computer science, and it is the canonical intractable (W[l]-complete) problem 
in parameterized complexity. 

A naive algorithm solves /c-Clique in 0{n k ) time. By a reduction from 1985 to BMM on matrices 
of size n fc / 3 x n fc//3 it can be solved with fast matrix multiplication in 0{n uk ^) time |NP85] whenever 
k is divisible by 3 (otherwise, more ideas are needed |EG04| ). No better algorithms are known, and 
researchers have wondered if improvements are possible [Woe04l lAut84] . As is the case for BMM, 
obtaining faster than trivial combinatorial algorithms, by more than polylogarithmic factors, for k- 
Clique is a longstanding open question. The fastest combinatorial algorithm runs in 0(n k / \og k n) 
time [ Vas09j . 

Let 0 < F < u> and 0 < C < 3 be the smallest numbers such that 3/c-Clique can be solved 
combinatorially in 0(n Ck ) time and in 0{n Fk ) time by any algorithm, for any (large enough) 
constant k > 1. A conjecture in graph algorithms and parameterized complexity is that C = 3 
and F = oj. It is known that an algorithm refuting this conjecture immediately implies a faster 
exact algorithm for MAX-CUT |Wil05L [Woe08j| . Note that even a linear time algorithm for BMM 
(ui = 2) would not prove that F < 2. A well known result by Chen et al. |CCF + 05l ICHKX06] 
shows that F > 0 under the Exponential Time Hypothesis. A plausible conjecture about the 
parameterized complexity of Subset-Sum implies that F > 1.5 [ALW 14j . There are many other 
negative results that intuitively support this conjecture: Vassilevska W. and Williams proved that 
a truly subcubic combinatorial algorithm for 3-Clique implies such algorithm for BMM as well 
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[WWlOj . Unconditional lower bounds for fc-Clique are known for various computational models, 
such as f l(n k ) for monotone circuits |AB87j . The planted Clique problem has also proven to be 
very challenging (e.g. AAK + 07l IAKS981 IHKlll Uer92j ). Max-Clique is also known to be hard to 


efficiently approximate within nontrivial factors }Has99| . 

Formally, our reduction from fc-Clique to CFG recognition proves the following theorem. 


Theorem 1. There is context-free grammar Qc of constant size such that if we can determine if 
a string of length n can be obtained from Qc in T(n ) time, then k-Clique on n node graphs can be 
solved in O (T (n fc//3+1 )) time, for any k > 3. Moreover, the reduction is combinatorial. 

To see the tightness of our reduction, let 1 < / < u and 1 < c < 3 denote the smallest numbers 
such that CFG recognition can be solved in 0(ri/) time and combinatorially in 0(n c ) time. An 
immediate corollary of Theorem [T] is that / > F and c > C. Under the plausible assumption 
that current fc-Clique algorithms are optimal, up to n° W improvements, our theorem implies that 
/ > oj and c > 3. Combined with Valiant’s algorithm we get that / = w and with standard CFG 
parsers we get that c = 3. Because our grammar size g is fixed, we also rule out 0(h(g) -n 3_e ) time 
combinatorial CFG parsers for any computable function h{g). 

In other words, we construct a single fixed context-free grammar Qc for which the recognition 
problem (and therefore any parsing problem) cannot be solved any faster than by Valiant’s algo¬ 
rithm and any combinatorial recognizer will not run in truly subcubic time, without implying a 
breakthrough algorithm for the Clique problem. This (conditionally) proves that these algorithms 
are optimal general purpose CFG parsers, and more efficient parsers will only work for CFL with 
special restricting properties. On the positive side, our reduction might hint at what a CFG should 
look like to allow for efficient parsing. 

The online version of CFG recognition is as follows: preprocess a CFG such that given a string 
w that is revealed one letter at a time, so that at stage i we get io[l • • • *], we can say as quickly as 
possible whether w[l ■ ■ ■ i\ can be derived from the grammar (before seeing the next letters). One 
usually tries to minimize the total time it takes to provide all the |tn| = n answers. This problem 
has a long history of algorithms [Wei761 IGRH80( |Ryt85| and lower bounds [HS651 IGalfi9l ISei86| . 
The current best upper bound is 0(n 3 /log 2 ?r) total running time, and the best lower bound is 
fl(?r 2 /log n). Since this is a harder problem, our lower bound for CFG recognition also holds for it. 


1.1 More Results 

The main ingredient in the proof of Theorem |T] is a lossless encoding of a graph into a string that 
belongs to a simple CFL iff the graph contains a £:-Clique. Besides classifying the complexity of a 
fundamental problem, this construction has led us to new lower bounds for two more modern and 
well-studied cubic time problems for which faster algorithms are highly desirable in practice. 


RNA Folding The RNA folding problem is a version of maximum matching and is one of the 
most relevant problems in computational biology. Its most basic version can be neatly defined as 
follows. Let E be a set of letters and let S' = {a' \ a £ S} be the set of “matching” letters, such 
that for every letter a £ S the pair a, a' match. Given a sequence of n letters over S U S' the RNA 
folding problem asks for the maximum number of non-crossing pairs {i,j} such that the i th and 
j th letter in the sequence match. 

The problem can be viewed as a Stochastic CFG parsing problem in which the CFG is very re¬ 
stricted. This intuition has led to many mildly subcubic algorithms for the problem jSonl51IAku98i 
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IBTZZUTTl IPTZZUTTl IVGF131IFG091 IZTZUlOj . The main idea is to adapt Valiant’s algorithm by 
replacing his BMM with a (min, +)-matrix product computation (i.e. distance or tropical prod¬ 
uct). Using the fastest known algorithm for (min, +)-matrix multiplication it is possible to solve 
RNA folding in 0(n 3 / 2 x/logn ) time |Will4j . The fastest combinatorial algorithms, however, run in 
roughly 0(n 3 / log 2 n) time |Sonl51 [BTZZUlTj . and we show that a truly subcubic such algorithm 
would imply a breakthrough Clique algorithm. Our result gives a negative partial answer to an 
open question raised by Andoni [Andl4] , 

Theorem 2. If RNA Folding on a sequence of length n can be solved in T(n ) time, then k-Clique 
on n node graphs can be solved in O (T ( n fc / 3 +0(i)^ time, for any k > 3. Moreover, the reduction 
is combinatorial. 

Besides a tight lower bound for combinatorial algorithms, our result also shows that a faster than 
0(n u ) algorithm for RNA Folding is unlikely. Such an upper bound is not known for the problem, 
leaving a small gap in the complexity of the problem after this work. However, we observe that 
the known reductions from RNA Folding to (min, +)-nratrix multiplication produce matrices with 
very special “bounded monotone” structure: the entries are bounded by n and every row and 
column are nronotonically increasing. This exact structure has allowed Chan and Lewenstein to 
solve the “bounded monotone” (min, +)-convolution problem in truly subquadratic time |CL15j . 
Their algorithm uses interesting tools from additive combinatorics and it seems very plausible that 
the approach will lead to a truly subcubic (non-combinatorial) algorithm for “bounded monotone” 
(min, +) product and therefore for RNA Folding. 

Dyck Edit Distance The lower bound in Theorem |T| immediately implies a similar lower bound 
for the Language Edit Distance problem on CFLs, in which we want to be able to return the 
minimal edit distance between a given string w and a string in the language. That is, zero if w is 
in the language and the length of the shortest sequence of insertions, deletions, substitutions that 
is needed to convert w to a string that is in the language otherwise. This is a classical problem 
introduced by Aho and Peterson in the early 70’s with cubic time algorithms |AP72l |Mye95| and 
many diverse applications [KSSY13] lGCS + f)0] IFM8D| . Very recently, Rajasekaran and Nicolae 
|RN14] and Saha |Sahl4b| obtained truly subcubic time approximation algorithms for the problem 
for arbitrary CFGs. 

However, in many applications the CFG we are working with is very restricted and therefore 
easy to parse in linear time. One of the simplest CFGs with big practical importance is the Dyck 
grammar which produces all strings of well-balanced parenthesis. The Dyck recognition problem 
can be easily solved in linear time with a single pass on the input. Despite the grammar’s very 
special structure, the Dyck Edit Distance problem is not known to have a subcubic algorithm. 
In a recent breakthrough, Saha |Sahl4a| presented a near-linear time algorithm that achieves a 
logarithmic approximation for the problem. Dyck edit distance can be viewed as a generalization of 
the classical string edit distance problem whose complexity is essentially quadratic [ICLRS091IBI15] . 
and Saha’s approximation algorithm nearly matches the best known approximation algorithms for 
string edit distance of Andoni, Krauthgamer, and Onak [AKOlOj . both in terms of running time 
and approximation factor. This naturally leads one to wonder whether the complexity of (exact) 
Dyck edit distance might be also quadratic. We prove that this is unlikely unless w = 2 or there 
are faster clique finding algorithms. 
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Theorem 3. If Dyck edit distance on a sequence of length n can be solved in T(n) time, then 
2>k-Clique on n node graphs can be solved in O (T ( n k+ ° i 1 ))) time, for any k > 1. Moreover, the 
reduction is combinatorial. 

Our result gives an answer to an open question of Saha |Sahl4a] . who asked if Lee’s lower 
bound holds for the Dyck Edit Distance problem, and shows that the search for good approximation 
algorithms for the problem is justified since efficient exact algorithms are unlikely. 

Remark A simple observation shows that the longest common subsequence (LCS) problem on 
two sequences x , y of length n over an alphabet £ can be reduced to RNA folding on a sequence of 
length 2 n: if y = y\ ... y n then let y := '{/„ ■■■ y\ and then RN A(x o y) = LCS(x, y). A quadratic 
lower bound for LCS was recently shown under the Strong Exponential Time Hypothesis (SETH) 
[ABV151 iBKlhj . which implies such a lower bound for RNA folding as well (and, with other ideas 
from this work, for CFG recognition and Dyck Edit Distance). However, we are interested in higher 
lower bounds, ones that match Valiant’s algorithm and basing such lower bounds on SETH would 
imply that faster matrix multiplication algorithms refute SETH - a highly unexpected breakthrough. 
Instead, we base our hardness on fc-Clique and devise more delicate constructions that use the 
cubic-time nature of our problems. 

Proof Outlines In our three proofs, the main approach is the following. We will first preprocess 
a graph G in 0(n k+0 ^) time in order to construct an encoding of it into a string of length 
0(n fe+0 ( 1 )). This will be done by enumerating all £;-cliques and representing them with carefully 
designed gadgets such that a triple of clique gadgets will “match well” if and only if the triple make 
a 3/c-clique together, that is, if all the edges between them exists. We will use a fast (subcubic) 
CFG recognizer, RNA folder, or an Dyck Edit Distance algorithm to speed up the search for such 
a “good” triple and solve 3/c-Clique faster than 0(n 3k ). These clique gadgets will be constructed 
in similar ways in all of our proofs. The main differences will be in the combination of these cliques 
into one sequence. The challenge will be to find a way to combine 0{n k ) gadgets into a string in a 
way that a “good” triple will affect the overall score or parse-ability of the string. 

Notation and Preliminaries All graphs in this paper will be on n nodes and 0(n 2 ) undirected 
and unweighted edges. We associate each node with an integer in [n] and let v denote the encoding 
of v in binary and we will assume that it has length exactly 21ogn for all nodes in V(G). When 
a graph G is clear from context, we will denote the set of all fc-cliques of G by Ck- We will denote 
concatenation of sequences with x o y, and the reverse of a sequence x by x R . Problem definitions 
and additional problem specific preliminaries will be given in the corresponding section. 

2 Clique to CFG Recognition 

This section we show our reduction from Clique to CFG recognition and prove Theorem [TJ 

Given a graph G = (V,E), we will construct a string w of length 0{k 2 • n k+1 ) that encodes 
G. The string will be constructed in 0[k 2 • n k+l ) time which is linear in its length. Then, we will 
define our context free grammar Qc which will be independent of G or k and it will be of constant 
size, such that our string w will be in the language defined by Qc if an d only if G contains a 3k 
clique. This will prove Theorem [lj 


6 







Let £ — {0,1, $, 7 - 3-start; 3-mid) 3 cn( j, b s j ar ), bmid) b en( j, c s t ar t, c ril j ( ]. c en( j} be oui set of 13 terminals 
(alphabet). As usual, £ will denote the empty string. We will denote the derivation rules of a 
context free grammar with —> and the derivation (by applying one or multiple rules) with . 


The string First, we will define node and list gadgets: 

NG(v) = # v # and LG(v) = # O ueN ( v ) ($ u R $) # 

Consider some t = {iq,... ,Vk} G Ck■ We now define “clique node” and “clique list” gadgets. 
CNG(t) = CW NG(v)) k and CLG(t) = ( O v& LG(v)) k 
and our main clique gadgets will be: 

CG a (t ) = 3 s tart CNG(t ) a m jd CNG(t ) a cnc j 
CGp{t) = b s tart CLG(t ) bmid CNG{t ) b en( j 
CGy(t) = C s tart CLG(t ) C m jd CLG(t) C en d 
Finally, our encoding of a graph into a sequence is the following: 

w = (O t & c k CG a (t)) (O teC k CGp(t)) (O teC k CG,(t)) 


The Clique Detecting Context Free Grammar. The set of non-terminals in our grammar 

Qc is: 


T = {S, W, W', V, S Q7 , S a/ 3, S^, S* 7 , S^, S^ 7 , N a7 , N a/3 , N^}. 


The “main” rules are: 


S t 3 s t ar t S a7 c en( j W 

S a7 ^ a mid bmidS^ C m i(j 

s * a p 3 cn( j W k start 
^/37 ^ ^end c start 

And for every xy € {a/3, ay, /3y} we will have the following rules in our grammar. These rules 
will be referred to as “listing” rules. 


S xy ^ 

Q* 

° xy 

S xy ^ 

# N xy $ v # 

N -4 
L ^xy 7 

# S X y # V $ 

N -4 
L 'xy 7 

G N X y G 


Mg <E {0,1} 


Then we also add “assisting” rules: 

aW Mg € £ 

dW' Mg G {0,1} 

$ W' $ V 

Our Clique Detecting grammar Qc has 13 non-terminals T, 13 terminals £, and 38 derivation 
rules. The size of Qc , he. the sum of the lengths of the derivation rules, is 132. 


£ 

W' -» £ 
£ 
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The proof. This proof is essentially by following the derivations of the CFG, starting from the 
starting symbol S and ending at some string of terminals, and showing that the resulting string 
must have certain properties. Any encoding of a graph into a string as we describe will have these 
properties iff the graph contains a 3fc-clique. The correctness of the reduction will follow from the 
following two claims. 

Claim 1 . If Qc => w then G contains a 3k-clique. 

Proof. The derivation of w must look as follows. First we must apply the only starting rule, 


S ■' wi a start S r>7 c enc ] tv2 

where a st art appears in CG a (t a ) for some t a £ Ck and c st art appears in CG 1 (t 1 ) for some t 7 € Ck, 
and w\ is the prefix of w before CG(t a ) and W 2 is the suffix of w after CG(t 7 ). Then we can get, 

S a7 CNG(t a ) S* 7 CLG(t 7 ) 

by repeatedly applying the xy- “listing” rules where xy = ay and finally terminating with the rule 
S a7 —)• S* 7 . By Lemma [T] below, this derivation is only possible if the nodes of t a U f 7 make a 
2fc-clique (call this observation (*)). Then we have to apply the derivation: 


c* 

^cry 


3-mid S a p b, n i ( ]S^ 7 C, MX ] 


and for some tp £ Ck we will have, 

S a p => CNG{t a ) S* a/3 CLG(tp), and S^ 7 =► CNG(tp) Sfa CLG(t 7 ), 

where in both derivations we repeatedly use “listing” rules before exiting with the S xy —» S* xy rule. 
Again, by Lemma [T] we get that the nodes of t a U tp are a 2 k clique in G, and that the nodes of 
tp U t 7 are a 2k clique as well (call this observation (**)). Finally, we will get the rest of w using 
the derivations: 

S op '' 3-end d.’s b s t a rt; and S pj V b en d IV 4 Cgtart; 

where W 3 is the substring of w between CG(t a ) and CG(tp), and similarly w 4 is the substring of w 
between CG[tp ) and CG(t 7 ). 

Combining observations (*) and (**), that we got from the above derivation scheme and 
Lemma [TJ we conclude that the nodes of t a U tp U t 7 form a 3fc-clique in G , and we are done. 

To complete the proof we will now prove Lemma |T} 

Lemma 1. If for some 1 .1' £ Ck and xy € {a/3, ay, /3y} we can get the derivation S xy =>- 
CNG(t ) S * xy CLG(t'), only using the “listing” rules, then tut' forms a 2k clique in G. 

Proof. Any sequence of derivations starting at S xy and ending at will have the following form. 
From S xy we can only proceed to non-terminals V and TSl xy . The non-terminal V does not produce 
any other non-terminals. A single instantiation of can only produce (via the second “listing” 
rule) a single instantiation of N XJ/ . In turn, from N xy we can only proceed to non-terminals V and 
S xy , and again, a single instantiation of can produce a single instantiation of S xy . Thus, we 
produce some terminals (on the left and on the right) from the set {ff, $, 0,1} and then we arrive 
to S xy again. This can repeat an arbitrary number of times, until we apply the rule S xy —» S* xy . 


Thus, the derivations must look like this: 


S xy =>■ I K v r 

for some strings £,r of the form {$:,$, 0,1}*, and our goal is to prove that £,r satisfy certain 
properties. 

It is easy to check that V can derive strings of the following form p = ($ {0,1}* $)*, that is, 
it produces a list (possibly of length 0) of binary sequences (possibly of length 0) surrounded by 
$ symbols (between every two neighboring binary sequences there are two $). A key observation 
is that repeated application of the fourth “listing” rule gives derivations N xy s N xy s R , for 

any s E {0,1}*. Combining these last two observations, we see that when starting with S xy we can 
only derive strings of the following form, or terminate via the rule S xy —>• S* xy . 

S xy =>■ # $ pi # =>• # s N xy s R $ pi # =>• # s # S xy ff p 2 $ s R $ pi # (1) 

for some pi,P 2 of the form ($ {0,1}* $)*. 

Now consider the assumption in the statement of the lemma, and recall our constructions of 
“clique node gadget” and “clique list gadget” . By construction, CNG(t ) is composed of k 2 node 
gadgets (NG) separated by ff symbols, and CLG is composed of k 2 list gadgets (LG) separated by 
ff symbols. Note also that the list gadgets contain 0(n) node gadgets within them and those are 
separated by $ symbols, and there are no ff symbols within the list gadgets. 

For every i E [A; 2 ], let t{ be the i th NG in CNG(t) and let r* be the i th LG in CLG(t'). Then, 

for every 2 < i < k, in the derivation S xy =>• CNG(t) S xy CLG(t'), we must have had the 

derivation 

S xy — t* £\ • • • £i —i ( S X y ) r k 2_ i+2 ■ ■ ■ v k 2 —> t\ ■ ■ • £i —i ( £% S xy r k 2 _ i+ i ) r k 2 _j _|_2 ■ ■ ■ r k 2 

and by (1) this implies that the binary encoding of the node v E t that appears in the i th NG 

in CNG(t ) must appear in one of the NGs that appear in the (k 2 — i + 1) LG in CLG(t') which 
corresponds to a node u E t'. Since LG(u ) contains a list of neighbors of the node u, this implies 
that v E N(u) and {it, v} E E. Also note that u £ N(u) and therefore u does not appear in LG(u ) 
and therefore v cannot be equal to u if this derivation occurs. 

Now, consider any pair of nodes v E t,u E t'. By the construction of CNG and CLG, we must 
have an index i E [A; 2 ] such that the i th NG in CNG(t ) is NG(v ) and the (k 2 — i + 1) LG in 

CLG(t') is CLG(u). By the previous argument, we must have that u / v and {u,v} E E is an 

edge. Given that t, t' are fc-cliques themselves, and any pair of nodes v E t, u E t' must be neighbors 
(and therefore different), we conclude that t U t' is a 2/c-clique. □ 

□ 


Claim 2. If G contains a 3k-clique, then Qc ==> w. 

Proof. This claim follows by following the derivations in the proof of Claim [Tj with any triple 
t a ,tp,try € C k of A;-cliques that together form a 3A:-clique. □ 

We are now ready to prove Theorem [TJ 

Reminder of Theorem |T] There is context-free grammar Qc of constant size such that if we 
can determine if a string of length n can he obtained from Qc in T(n ) time, then k-Clique on 
n node graphs can be solved in O (T (n k ^ 3+1 )) time , for any k > 3. Moreover, the reduction is 
combinatorial. 
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Proof. Given an instance of 3/c-Clique, a graph G = (V, E) we construct the string w as described 
above, which will have length 0(k 2 • n k+1 ), in 0(k 2 ■ n k+1 ) time. Given a recognizer for Qc as in 
the statement, we can check whether Qc =>• w hi 0 (T (n fe / 3+1 )) time (treating k as a constant). 
By Claims |T| and El Qc ==> w iff the graph G contains a 3A;-clique. □ 

3 Clique to RNA folding 

In this section we prove Theorem El by reducing /c-Clique to RNA folding, defined below. 

Let X be an alphabet of letters of constant size. For any letter a E X there will be exactly one 
“matching” letter which will be denoted by a'. Let X' = {a' \ a E X} be the set of matching letters 
to the letters in X. Throughout this section we will say that a pair of letters {x, y} match iff y = x' 
or x = y'. 

Two pairs of indices (*i, ji), (* 2 , J 2 ) such that i\ < j\ and 12 < j 2 are said to “cross” iff at 
least one of the following three conditions hold: (i) i\ = 12 or i\ = j' 2 , or j\ = * 2 , or j 1 = J 2 ; (ii) 
i\ < i 2 < j\ < J2] (hi) 12 < h < J2 < j\■ Note that by our definition, non-crossing pairs cannot 
share any indices. 

Definition 1 (RNA Folding). Given a sequence S of n letters from X U X', what is the maximum 
number of pairs A = {( i,j ) i < j and i,j E [n]} such that for every pair ( i,j) € A the letters 
5[i] and 5[j] match and there are no crossing pairs in A. We will denote this maximum value by 
RNA(S). 

It is interesting to note that RNA can be seen as a language distance problem with respect 
to some easy to parse grammar. Because of the specific structure of this grammar, our reduction 
from Section El does not apply. However, the ideas we introduced allow us to replace our clique 
detecting grammar with an easier grammar if we ask the parser to return more information, like 
the distance to a string in the grammar. At a high level, this is how we get the reduction to RNA 
folding presented in this section. 

To significantly simplify our proofs, we will reduce fc-Clique to a more general weighted version 
of RNA folding. Below we show that this version can be reduced to the standard RNA folding 
problem with a certain overhead. 

Definition 2 (Weighted RNA Folding). Given a sequence S of n letters from XUX' and a weight 
function w : X —>• [M], what is the maximum weight of a set of pairs A = {( i,j) \ i < j and i,j € 
[n]} such that for every pair ( i,j ) E A the letters 5[i] and S’[)] match and there are no crossing 
pairs in A. The weight of A is defined as j)eA U '(‘S'[*])- denote this maximum value by 

WRNA(S). 

Lemma 2. An instance S of Weighted RNA Folding on a sequence of length n, alphabet X U X', 
and weight function w : X —> [M] can be reduced to an instance S of RNA Folding on a sequence 
of length 0{Mn ) over the same alphabet. 

Proof. Let S = S\ ■ ■ ■ S n , we set S := ... Sn^ S "\ that is, each symbol Si is repeated w(Si) 

times. First, we can check that WRNA(S) < RNA(S). This holds because we can replace each 
matching pair {a,^} in the folding achieving weighted RNA score of WRNA(S) with w(a) such 
pairs in the (unweighted) RNA folding instance S giving the same contribution to RNA(S). 
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Now we will show that RNA(S) < WRNA(S). Suppose thet there are symbols a and a' in S 
that are paired. The symbol a comes from a sequence si of a symbols of length w(a). The sequence 
si was produced from a single symbol a when transforming S into S. Similarly, the symbol a' 
comes from a sequence s 2 of a' symbols of length w(a). Also, assume that there exists a symbol in 
si that is paired to a symbol that is outside of s 2 or there exists a symbol in s 2 that is paired to a 
symbol outside of si. While we can find such symbols a and a', we repeat the following procedure. 
Choose a and a' that satisfy the above properties. And choose them so that the number of other 
symbols between a and a' is as small as possibly. Break ties arbitrarily. We match all symbols in 
si to their counterparts in s 2 . Also, we rematch all symbols that were previously matched to si 
or S 2 among themselves. We can check that we can rematch these symbols so that the number of 
matched pairs do not decrease. 

Therefore, we can assume that in some optimal folding of S, for any pair a € £, a' € £ 7 that is 
matched the corresponding substrings si and s 2 are completely paired up. Thus, to get a folding 
of S that achieves WRNA(S) at least RNA(S) we can now simply fold the corresponding symbols 
to si and s 2 , for any such pair a, a'. □ 

3.1 The Reduction 

Given a graph G = (V, E) on n nodes and 0(n 2 ) unweighted undirected edges, we will describe 
how to efficiently construct a sequence Sq over an alphabet £ of constant size, such that the RNA 
score of Sc will depend on whether G contains a 3fc-clique. The length of Sg will be 0(k d n k+c ) 
for some small fixed constants c, d > 0 independent of n and k, and the time to construct it from 
G will be linear in its length. This will prove that a fast (e.g. subcubic) RNA folder can be used 
as a fast 3fc-clique detector (one that runs much faster than in 0 (n 3k ) time). 

Our main strategy will be to enumerate all fc-cliques in the graph and then search for a triple of 
^-cliques that have all the edges between them. We will be able to find such a triple iff the graph 
contains a 3/c-clique. An RNA folder will be utilized to speed up the search for such a “good” 
triple. Our reduction will encode every fc-clique of G using a “short” sequence of length 0(n c ) such 
that the RNA folding score of a sequence composed of the encodings of a triple of sequences will 
be large iff the triple is “good”. Then, we will show how to combine the short encodings into our 
long sequence Sq such that the existence of a “good” triple affects the overall score of an optimal 
folding. 

The RNA Sequence Our sequence Sq will be composed of many smaller gadgets which will be 
combined in certain ways by other padding gadgets. We construct these gadgets now and explain 
their useful properties. The proofs of these properties are postponed until after we present the 
whole construction of Sq- 

For a sequence s € £* let p(s) € (£')* be the sequence obtained from s by replacing every letter 
<7 € £ with the matching letter o' € £'. That is, if s = si • • • s n then p(s) = s) • • ■ s' n . 

Our alphabet £ will contain the letters 0,1 and some additional symbols which we will add as 
needed in our gadgets. We will set the weights so that w{ 0) = w{l) = 1, and the extra symbols 
we add will be more “expensive”. For example, we will add the $ symbol to the alphabet and set 
w($) = 10 • logn. We define node gadgets as, 

NG(v) = $ 2n v $ 2n 
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and list or neighborhood gadgets as, 


LG{y) — o u£N(v)(§ U $) o On^Ar(u)($ $)• 

These gadget are constructed so that for any two nodes u,v E V{G), the RNA folding score of 
the sequence NG(v) op ( LG{u)) R is large (equal to some fixed value E\) if v is in the neighborhood 
of u, that is (u,v) E E(G ), and smaller otherwise (at most E\ — 1). This is because the more 
expensive $ symbols force an optimal folding to match v with exactly one p(w) R , since otherwise a 
$' will remain unmatched while many $ are free. The construction also allows the folding to pick 
any w E N(u) to fold together with v. without affecting the score from the $ symbols. Then, we 
use the fact that v o p(w) R achieves maximal score iff v = w. This is formally proved in Claim [3) 

Let i\ = 10 k 2 ■ n log n and note that t\ is an upper bound on the total weight of all the symbols 
in the gadgets NG(v) and LG(v), for any node v E V(G). Let Cj. be the set of A;-cliques in G and 
consider some t = {ui,... ,Vk} E Ck- We will now combine the node and list gadgets into larger 
gadgets that will be encoding ^-cliques. 

We will add the # symbol to the alphabet and set w(jf) = l\, i.e. a single # letter is more 
expensive than entire k 2 node or list gadget. We will encode a clique in two ways. The first one is, 

CNG(t)=O vet (#NG(v)#) k 

and the second one is, 

CLG(t) = (O v& (#LG(v)#)) k . 

These clique gadgets are very useful because of the following property. For any two &:-cliques 
t\,t 2 € Ck, the RNA folding score of the sequence CNG{t\) op (C LG(t 2 )) R is large (equal to some 
fixed value E 2 ) if t\ and t 2 , together, form a 2 £>clique, and is smaller otherwise (at most E 2 — 1 ). 
That is, the RNA folding score of the sequence tells us whether any pair of nodes u E t±,v E t 2 
are connected (u,v) € E(G ). There are two ideas in the construction of these gadgets. First, we 
copy the gadgets corresponding to the k nodes of the cliques k times, resulting in k 2 gadgets, and 
we order them in a way so that for any pair of nodes u E t\, v E t 2 there will be a position i such 
that the gadget of u in CNG(t\) and the gadget of v in p(CLG{t 2 )) R are both at position i. Then, 
we use the expensive # separators to make sure that in an optimal RNA folding of CNG{t \) and 
p(CLG(t 2 )) R , the gadgets at positions i are folded together, and not to other gadgets - otherwise 
some # symbol will not be paired. This is formally proved in Claim [H 

Let £2 = 10 • k 2 ■ t\ = 0{n log n) and note that it is an upper bound on the total weight of 
all the symbols in the CNG(t ) and CLG{t) gadgets. Finally, we introduce a new letter to the 
alphabet g and set its weight to w( g) = £ 2 , which is much more expensive than the entire gadgets 
we constructed before, and then define our final clique gadgets. Moreover, we will now duplicate our 
alphabet three times to force only “meaningful” foldings between our gadgets. It will be convenient 
to think of a , j3, 7 as three types such that we will be looking for three fc-clique, one from type a one 
from j3 and one from 7 . For any pair of types xy E { a/3 , 07 , ^ 7 } we will construct a new alphabet 
Ti xy = {<T xy | <7 E X} in which we mark each letter with the pair of types it should be participating 
in. For a sequence s E (£ U £)* we use the notation [s] xy to represent the sequence in ( T, xy U T,' xy )* 
in which we replace every letter cr with the letter a xy . 

We will need three types of these clique gadgets in order to force the desired interaction between 
them. 

CG a (t) = [g CNG(t) g ] a7 o [g' p(CLG(t)) R g '] a/3 
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CGp(t) = [g CNG(t) gU ° [g' p{C LG{t)) R g']^ 

CG y (t) = [g CNG(t) g } 01 o [g' p(CLG(t)) R g'] a7 

These clique gadgets achieve exactly what we want: for any three fc-cliques t a ,tp,ty € Ck the 
RNA folding score of the sequence CG a (f Q )oCGg(fg)oCG 7 (f 7 ) is large (equal to some value £ 3 ) if 
t a U tp U t 7 is a 3fc-clique and smaller otherwise (at most £3 — 1 ). In other words, an RNA folder can 
use these gadgets to determine if three separate fc-cliques can form a 3fc-clique. This is achieved by 
noticing that the highest priority for an optimal folding would be to match the g xy letters with their 
counterparts g' xy , which leaves us with three of sequences to fold: S ai = [C NG(t a )op(C LG(t y )) R ] ai 
and S a p = [p(CLG(t a )) R o CNG(tp)] a p and S *g 7 = [p{CLG{tp)) R o C'A’G(t 7 )] i g 7 . The maximal 
score (£2) in each one of these three sequences can be achieved iff every pair of our /c-cliques form 
a 2fc-clique which happens iff they form a 3£:-clique. This is formally proved in Claim 0 

The only remaining challenge is to combine all the sequences corresponding to all the 0(n k ) 
fc-cliques in the graph into one sequence in a way that the existence of a “good” triple, one that 
makes a 3/c-clique, affects the RNA folding score of the entire sequence. Note that if we naively 
concatenate all the clique gadgets into one sequence, the optimal sequence will choose to fold clique 
gadgets in pairs instead of triples since folding a triple makes other gadgets unable to fold without 
crossings. Instead, we will use the structure of the RNA folding problem again to implement a 
“selection” gadget that forces exactly three clique gadgets to fold together in any optimal folding. 
We remark that the implementation of such “selection” gadgets is very different in the three proofs 
in this paper: In Section [2] we use the derivation rules, in this section we use the fact that even 
when folding all expensive separators in a sequence to the left or right we are left with an interval 
that is free to fold with other parts of the sequence, and in Section 0] we rely on the restriction of 
Dyck that an opening bracket can match only to closing brackets to its right. 

Towards this end, we introduce some extremely expensive symbols a, /3, 7 . Let £3 = IOI 2 be 
an upper bound on the total weight the CG x (t ) gadgets, and set w(a) = w(j3) = it; ( 7 ) = l 3 . Our 
“clique detecting” RNA sequence is defined as follows. 

Sg = a 2nk Otec, (<*' CG a (t) a') a 2nk 
° P 2nk Otec, CGp{t) /?) P 2nk 

o l 2nk OteC k W CG^t) y) i 2nk 

The added padding makes sure that all but one CG a gadget are impossible to fold without 
giving up an extremely valuable a, a' pair, and similarly all but one CGp and one C*G 7 cannot be 
folded. To see this, assume all the a' are paired (left or right) and note that if both a' symbols 
surrounding a clique gadget CG a (t ) are paired to one side (say, left) then the only non crossing 
pairs that the gadget could participate in are either with a symbols (but those cannot be matches) 
or within itself. Our marking of symbols with pairs of types xy make it so that a clique gadget 
cannot have any matches with itself. Therefore, if all a' symbols are matched, then all but one 
CG a (t) gadgets do not participate in any foldings. The argument for /?, 7 is symmetric. We are left 
with a folding of a sequence of three clique gadgets CG a (t a ), CGg(tp), CG~ f (tj) which can achieve 
maximal score iff t a U tp U f 7 is a 3fc-clique. 

This proves our main claim that the (weighted) RNA folding score of our clique detecting 
sequence Sq is large (equal to some fixed value Eq) if the graph contains a 3fc-clique and smaller 
(at most Ec — 1) otherwise. See Claim [ 6 ] for the formal proof. 
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Our final alphabet X has size 18 (together with S' this makes 36 symbols). 

X = {a, j3, 7 } U |J {0,1 ,$,#,g} xy 

xy£{oi(3,a'y,/3'Y} 

Observe that Sq can be constructed from G in 0(n k+1 ) time, by enumerating all subsets of 
k nodes and that it has length 0(n k+1 ). The construction of Sq should be seen as a heavy 
preprocessing and encoding of the graph, after which we only have to work with fc-cliques. The 
largest weight we use in our construction is £3 = 0(k°^n log n) and therefore using Lemma [2] 
we can reduce the computation of the weighted RNA of Sq to an instance of (unweighted) RNA 
folding on a sequence of length O(\SG\k°^n log n) = 0{k°^n k+2 ) which proves Theorem [2j 

Formal Proofs We will start with the proof that the list and node gadgets have the desired 
functionality. Let E\ = (lOn + 1) • logn. 

Claim 3. For any xy € {a/3, 07 , /Ty} and two nodes u,v E V(G) we have that the weighted RNA 
folding score WRNA([NG(u ) o p(LG{v)) R ] xy ) is E\ if u E N ( v) and at most E\ — 1 otherwise. 

Proof. Since all letters in the sequence we are concerned with have the same mark xy, we will omit 
the subscripts. If u € N(v) then p{u) R appears in LG{v) and we can completely match it with 
the sequence u in NG{u), giving a score of 2logn, then we match all the n $ ; symbols in LG(v ) 
to some $ symbols in NG{u ) and gain an extra score of n • 10logn. Therefore, in this case, the 
weighted RNA score is E\ = logn + lOnlogn. 

Now we assume that u ^ N(v) and show that in the optimal folding the score is at most Ei — 1. 
First, note that the sequence p{LG(y)) R has fewer $ 7 symbols (it has 2 n such symbols) than the 
sequence NG(u ) (which has 4n such symbols). By not pairing a symbol $' in p(LG(v)) R , we lose a 
score of w($) which is much more than the entire weight of the non-$ symbols in NG(u). Therefore, 
any matching which leaves some $' unmatched is clearly sub-optimal, and we can assume that all 
the $' symbols are matched. Given this, the substring u can only be folding to at most one of 
the substrings p(iJi) R in LG(v ) for some Vi € N(v). This folding can only achieve score 2logn — 1 
because u N(v). Thus, the total score of the optimal matching is no more than E± — 1. □ 

Next, we prove that the “clique node gadgets” and “clique list gadgets” check that two fc-cliques 
form one bigger 2A;-clique. Let E 2 = 2 k 2 ■ + k 2 • E\. 

Claim 4. For any xy € {a/3, ay, /Ty} and two k-cliques t\, t -2 € we have that the weighted RNA 
folding score WRNA([CNG(t 1 ) 0 p(CLG(t 2 )) R } X y) is E 2 ift\\Jt 2 is a 2k-clique and at most E 2 — I 
otherwise. 

Proof. We will omit the irrelevant xy subscripts. First, note that the sequences CNG{t\) and 
p(CLG(t 2 )) R have the same number of and ff' symbols, respectively. By not pairing a single 
one of them with its counterpart, we lose a contribution of w{0) to the WRNA score, which is 
much more than we could gain by pairing all the symbols in all the node and list gadgets (that is, 
the rest of the sequence). Therefore, we assume that all the ff and ff' symbols are paired. Let 
t\ = {u \,..., Ufc} and £2 = {^i,..., vq.}. We can now say that 

WRNA{CNG{ti) op{CLG{t 2 )) R ) = ( 2 k 2 )w{#) + WRNA{NG( Ui ) o p(LG( Vj )) R ). 

*e[fc] je[k] 
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and by Claim [3]we know that WRNA(NG{ui) o p{LG(vj)) R ) = E\ if m and Vj are connected and 
less otherwise. Therefore, we can only get the maximal E 2 = 2 k 2 ■ £\ + k 2 ■ E\ iff every pair of 
nodes, one from t\ and one from t 2 are connected. Since u N(u) for all u E V(G) and since ti,t 2 
are ^-cliques, we conclude that t± U t 2 is a 2/c-clique. □ 

We are now ready to prove the main property of our clique gadgets: a sequence of three 
clique gadgets (one from each type) achieves maximal score iff they form a 3/c-clique together. Let 
E 3 = 6£ 2 + 3E 2 . 

Claim 5. For any xy E {a/3, ay, /Ly} and three k-cliques t a . tp, ty € Ck we have that the weighted 
RNA folding score WRNA(CG a (t a ) o CGp(tp) o CGy{ty)) is E 3 if t\ U t 2 U t^ is a 3k-clique and 
at most E 3 — 1 otherwise. 

Proof. If for some xy € {a/3, 07 ,^ 7 } there is a symbol g xy which is not paired up with its coun¬ 
terpart, we lose a contribution to the WRNA score that is more than we could get by pairing up 
all symbols that are not g xy . Therefore, we have the equality 

WRNA(CG a (t a ) o CGg(tp) o CG-/(tj)) = 3-2£ 2 

+ WRNA([CNG(t a ) o p(CLG(t 7 )) /? ] Q7 ) 

+ WRNA([CNG(tp) o p(CLG(t 7 )) R ]^ 7 ) 

+ WRNA([p(CLG(t a )) R o CNG(tp)] a p)). 

By Claim [4j the last three summands are equal to E 2 if all our three fc-cliques are pairwise 2k- 
cliques and otherwise at least one of the summands is less than E 2 . The claim follows by noticing 
that t a U tg U ty is a 3/c-clique iff the three fc-cliques are pairwise 2fc-cliques. □ 

We are now ready to prove our main claim about Sq- This proof shows that our “selection” 
gadgets achieve the desired property of having exactly one clique from each type fold in an optimal 
matching. Let N = 0(n k ) be the size of Ck which is the number of ^-cliques in our graph and 
therefore the number of clique gadgets we will have from each type. We will set Ec = 6 N + E 3 . 

Claim 6. The weighted RNA score of Sq is Ec if G contains a 3k-clique and at most Ec — 1 
otherwise. 

Proof. Let x € {a, /3, 7 } and define t x > 0 denote the number of x' symbols in Sq that are not 
paired. Because any clique gadget CG X can only have matches with letters from clique gadgets 
CG y for some y E {a,/?, 7 } such that y / x, we can say that at most t x /2 + 1 clique gadget 
sequences CG X can have letters that participate in the folding. 

Recall that by definition of our weights, the total weight of any clique gadget is much less than 
£ 3/10 where £3 is the weight of a letter a, (3, 7 and recall the definition of iV = \Ck\- We will use 
the inequalities: 

WRNA(Sc) < ((W 2 + 1 ) + {tp/2 + 1) + {tp/2 + 1)) • 4/10 + ((2JV - t a ) + (2 N - t b ) + (2 N - t c ))£ 3 , 
and: 

WRNA(S G ) > ((2 N - t a ) + (2 N - t b ) + (2 N - t c ))£ 3 . 


15 


Since £3 £3/10, we must have that t a = tp = t 1 = 0 in any optimal folding of Sq. Now we get 

that: 

WRNA(S g ) = 6 N£ 3 + WRNA{CG a (t a ) o CGp(tp) o CG 7 (f 7 )) 

for some fe-cliques t a ,tp,t 7 € Ck- By Claim 0 the last summand can be equal to £3 iff the graph 
has a 3 &-clique, and must be at most £3 — 1 otherwise. This, and the fact that £<7 = 6AT3 + £3 
completes the proof. □ 

We are now ready to show that the construction of Sq from graph G proves Theorem [2j 
Reminder of Theorem [2] If RNA Folding on a sequence of length n can be solved in T{n) time, 
then k-Clique on n node graphs can be solved in O (T (n fc / 3+ °( 1 ))) time, for any k > 3 . Moreover, 
the reduction is combinatorial. 

Proof. Given a graph G on n nodes we construct the sequence Sg as described above. The sequence 
can be constructed in 0(k °^ • n k+1 ) time, by enumerating all subsets of k nodes and that it has 
length 0{k° B) • n k+l ). The largest weight we use in our construction is £3 = 0 (k°^n log n) and 
therefore using Lemma [ 2 ] we can reduce the computation of the weighted RNA of Sq to an instance 
of (unweighted) RNA folding on a sequence of length O (| SG\k° ^ n log n) = 0 {k°^n k+2 ). Thus, 
an RNA folder as in the statement returns the weighted RNA folding score of Sg in T(n k ^ +2 ) time 
(treating k as a constant) and by Claim [6] this score determines whether G contains a 3 /c-clique. 
All the steps in our reduction are combinatorial. □ 

4 Clique to Dyck Edit Distance 

In this section we prove Theorem 0 by reducing £Clique to the Dyck Edit Distance problem, 
defined below. 

The Dyck grammar is defined over a fixed size alphabet of opening brackets £ and of closing 
brackets £' = {a' \ a e £}, such that a can only be closed by a'. A string S belongs to the Dyck 
grammar if the brackets in it are well-formed. More formally, the Dyck grammar is defined by the 
rules S —>• SS and S —» a S a' for all u € £ and S —» e. This grammar defines the Dyck context 
free language (which can be parsed in linear time). 

The Dyck Edit Distance problem is: given a string S over £u£' find the minimum edit distance 
from S' to a string in the Dyck CFL. In other words, find the shortest sequence of substitutions 
and deletions that is needed to convert S into a string that belongs to Dyck. We will refer to this 
distance as the Dyck score or cost of S. 

Let us introduce alternative ways to look at the Dyck Edit Distance problem that will be useful 
for our proofs. Two pairs of indices (ii, Ji), [32,32) such that i\ < j\ and *2 < .72 are said to “cross” 
iff at least one of the following three conditions hold 

• h = i2 or u = j 2 , or ji = i 2 , or j x = j 2 ; 

• h<i 2 < ji < j'2 ; 

• i2 < h < 32 < j i- 

Note that by our definition, non-crossing pairs cannot share any indices. We define an alignment 
A of a sequence S of length n to be a set of non-crossing pairs ( i,j),i < j,i,j € [n]. If ( i,j ) is in 
our alignment we say that letter i and letter j are aligned. We say that an aligned pair is a match 
if S[i] = a for some a E £ and S[j] = a ', i.e. an opening bracket and the corresponding closing 
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bracket. Otherwise, we say that the aligned pair is a mismatch. Mismatches will correspond to 
substitutions in an edit distance transcript. A letter at an index i that does not appear in any 
of the pairs in the alignment is said to be deleted. We define the cost of an alignment to be the 
number of mismatches plus the number of deleted letters. One can verify that any alignment of 
cost E corresponds to an edit distance transcript from S' to a string in Dyck of cost E. and vice 
versa. 

4.1 The Reduction 

Given a graph G = (V, E) on n nodes and 0(n 2 ) unweighted undirected edges, we will describe 
how to efficiently construct a sequence Sq over an alphabet £ of constant size, such that the Dyck 
score of Sg will depend on whether G contains a 3/c-clique. The length of Sg will be 0(k d n k+c ) 
for some small fixed constants c, d > 0 independent of n and k, and the time to construct it from 
G will be linear in its length. This will prove that a fast (e.g. subcubic) algorithm for Dyck Edit 
Distance can be used as a fast 3/c-clique detector (one that runs much faster than in 0{n 3k ) time). 

As in the other sections, our main strategy will be to enumerate all /c-cliques in the graph and 
then search for a triple of /c-cliques that have all the edges between them. We will be able to find 
such a triple iff the graph contains a 3/c-clique. A Dyck Edit Distance algorithm will be utilized 
to speed up the search for such a “good” triple. Our reduction will encode every /c-clique of G 
using a “short” sequence of length 0(n c ) such that the Dyck score of a sequence composed of the 
encodings of a triple of sequences will be large iff the triple is “good”. Then, we will show how to 
combine the short encodings into our long sequence Sg such that the existence of a “good” triple 
affects the overall score of an optimal alignment. 

The Sequence Our sequence Sg will be composed of many smaller gadgets which will be com¬ 
bined in certain ways by other padding gadgets. We construct these gadgets now and explain their 
useful properties. The proofs of these properties are postponed until after we present the whole 
construction of Sg- 

Recall that we associate every node in V(G) with an integer in [n] and let v denote the encoding 
of v in binary and we will assume that it has length exactly 21ogn for all nodes. We will use the 
fact that there is no node with encoding 0. For a sequence s € £* let p(s) € (£')* be the sequence 
obtained from s by replacing every letter a E £ with the closing bracket o' E £k That is, if 
s = s\ ■■■ s n then p(s) = s\ ■ ■ ■ s' n . 

Our alphabet £ will contain the letters 0,1 and some additional symbols which we will add as 
needed in our gadgets like $,We will use the numbers £$,...,£§ such that £j, = (1000 • ro 2 )* +1 , 
which can be bounded by n°^. We define node gadgets as, 

NG{y) = $ £l v $ £l 


and list or neighborhood gadgets as, 

LG(v) = O u $ £ °) o 0^iV(D(^° 0 $ £ °). 

These gadget are constructed so that for any two nodes u,v E V(G), the Dyck score of the 
sequence NG(v) o p{LG{u))^ is small (equal to some fixed value E\) if v is in the neighborhood of 
u, that is (u,v) E E(G), and larger otherwise (at least E\ +1). This is proved formally in Claim0 
by similar arguments as in Section [3j 
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Note that £2 is an upper bound on the total length of all the symbols in the gadgets NG(v) 
and LG(v), for any node v E V(G). Let Ck be the set of fe-cliques in G and consider some 
t = {ui, ..., Vk} E Ck- We will now combine the node and list gadgets into larger gadgets that will 
be encoding /c-cliques. We will encode a clique in two ways. The first one is, 

CNG(t ) = O v& (# 4 NG(v ) # 4 ) fc 

and the second one is, 

CLG(t) = (O v& (# 4 LG{v) # 4 )) k . 

Note that £3 is an upper bound on the total length of all the symbols in the CNG(t ) and CLG(t) 
gadgets. We will add the symbol g to the alphabet. Moreover, we will now duplicate our alphabet 
three times to force only “meaningful” alignments between our gadgets. It will be convenient to 
think of a, j3 ,7 as three types such that we will be looking for three fc-cliques, one from type a one 
from j3 and one from 7 . For any pair of types xy E {a/3, ay, /Ty} we will construct a new alphabet 
Tj xy = {a xy | a E X} in which we mark each letter with the pair of types it should be participating 
in. For a sequence s € (£ U £')* we use the notation [s\ xy to represent the sequence in (E xy U E! xy )* 
in which we replace every letter a with the letter a xy . 

We will need three types of these clique gadgets in order to force the desired interaction between 
them. 

CG a (t) = a 4 (x(J 4 [g 4 CNG(t) g 4 ]« 7 o [g 4 p(CNG(t)) R g 4 ] a/3 y 4 (a ') 4 

CGpit) = b 4 (x ^) 4 [(g ') 4 CLG(t) (g 'Y 3 U 0 fe £3 P(CNG(t)) R g % 7 y 4 (b') 4 

CG y (t) = c 4 (x ;) 4 [(g ') 4 CLG{t ) (g ') 4 } Pl o [(g ') 4 p(CLG(t)) R (g ') 4 ] a7 y 4 (c ') 4 

These clique gadgets achieve exactly what we want: for any three L-cliques t a ,tp,t 7 E Ck 
the Dyck score of the sequence CG a (t a ) o CGg(tg) o CG-y(t 7 ) is small (equal to some value E%) 
if t a U ty U t 7 is a 3fc-clique and larger otherwise (at least E 3 + 1). This is formally proved in 
Claim [Oj again by similar arguments as in Section [3] (but more complicated because of the possible 
mismatches). 

The main difference over the proof of Section [3] is the way we implement the “selection” gadgets. 
We want to combine all the clique gadgets into one sequence in a way that the existence of a “good” 
triple, one that makes a 3fe-clique, affects the Dyck score of the entire sequence. The ideas we used in 
the RNA proof do not immediately work here because of “beneficial mismatches” of the separators 
we add with themselves and because in Dyck (a, a') match but (a 1 , a) do not (while in RNA we do 
not care about the order). We will use some new ideas. 

Our “clique detecting” sequence is defined as follows. 


S G = x a 4 (Q eC k CG a (t)) y'J 5 

o x/s ( OteC k CGy(t )) y '/ 5 

o x 7 4 (O te c fc CG 7 (f)) y ' 7 4 

As the x Q , y^ symbols are very rare and “expensive” an optimal alignment will match them to 
some of their counterparts within the a part of the sequence. However, when the x' a , y Q letters are 
matched, we cannot match the adjacent a, a' symbols - which are also quite expensive. Therefore, 
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the optimal behavior is to match the x Q , from exactly one “interval” from the a part. A similar 
argument holds for the /3 ,7 parts. This behavior leaves exactly one clique gadget from each type 
to be aligned freely with each other as a triple. By the construction of these gadgets, an optimal 
score can be achieved iff there is a 3/c-clique. 

This proves our main claim that the Dyck score of our clique detecting sequence Sq is small 
(equal to some fixed value Eq) if the graph contains a 3/c-clique and larger (at least Eq + 1) 
otherwise. See Claim [TO] for the formal proof. 

When k is fixed, Sq can be constructed from G in O(n k+0 ^) time, by enumerating all subsets 
of k nodes and that it has length 0(n k+0 ^). This proves Theorem [3] 

Our final alphabet E has size 24 (together with E 7 this makes 48 symbols). 

E= {a, b, c} U |J { 0 , l,$,#,g,x,y}^ 

xy(E:{a(3i cry,,# 7 } 

Formal Proofs Let E\ = logn • (n — 1) + (2£\ — n ■ 2 £q)/2. 

Claim 7. For any xy E {a/3, 07 , /Ly}, if v € N(u), then 

Dyck([NG(v) o p(LG(u))\ xy ) = E 1 


and > Ei otherwise. 

Proof. We will omit the subscripts xy since they do not matter for the proof. We want to claim that 
the binary sequence v is aligned to at most one binary sequence p{z R ). If this is not so, then there 
are 2 £q symbols $' from p(LG(u )) that will be mismatched or deleted thus contributing at least 
S\ = £q to the Dyck score. There will be at least 2i\ — (n— 1)2 £q symbols $ that are not matched to 
their counterparts . Those will contribute at least S 2 = t\ — (n — 1 )£q to the Dyck score. We get 
that the Dyck score is at least S\ + S 2 > E\. Now we assume that the binary sequence v is aligned 
to exactly one other binary sequence which we denote by p(zi R ). There are S 2 = 21og(n) • (n — 1) 
symbols from binary sequences from p{LG{u )) that are not matched to their counterparts. Also, 
there are S 3 = 2£\ — n ■ 2£q symbols $ that are not matched to their counterparts. The contribution 
of the unmatched symbols is > (S 2 + Ss)/2 = E\ to the Dyck cost. We want to show that the 
equality can be achieved iff v € N(u). From the proof it is clear that, if v E N(u), then we can 
achieve equality by choosing z\ to be the element from N(u) that is equal to v. Also, if we achieve 
equality, the symbols corresponding to S 2 and S 3 contribute > (S 2 + S 3)/2 to the Dyck score. 
These symbols contribute ($2 + <S , 3)/2 to the Dyck cost iff the mismatches happen only between 
themselves and all symbols $' are matched to their counterparts. The only remaining symbols that 
could potentially contribute to the score correspond to v and z\. They contribute 0 to the Dyck 
score iff v = zi R , that is, v E N(u). □ 

For the next proofs we will use the following definition. 

Definition 3. Given two sequences P and T, we define 

pattern(P, T) := min Dyck{PoQ). 

Q is a contiguous 
subsequence of T 


Let E ‘2 = k 2 • Ei. 
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Claim 8 . For any xy E {a/3, cry, /3y} and two k-cliques t±,t 2 E Ck, if t\ U t 2 is a 2k clique then 


Dyck{[CNG{t\) o CLG(t 2 )\ xy ) = E 2 


and > E 2 otherwise. 

Proof. We will omit the subscripts xy since they do not matter for the proof. We have that 
Dyck(CNG(ti ) o CLG{t 2 )) > k ■ pattern(NG(v ), CLG(t 2 )). 

v£ti 

Suppose that for some v E t\, NG(v) is aligned to more than one gadget p(LG(u)). Then symbols 
ft 1 between these gadgets p{LG(u )) will be substituted or deleted. The cost of these operations is 
> i 2 > E 2 . Therefore, we have that any one of k 2 gadgets NG(v ) is aligned to at most one gadget 
p(LG(u)) for some u E t 2 . By the construction of CNG and CLG , we have that 

Dyck(CNG(t\) o CLG(t 2 )) 
aEE pattern(NG(v), (# / ) 2 ^ 2 p(LG(ri))(#') 2f2 ) 

v£t\ u£t2 

= EE Dyck(NG(v ) o p(LG(u))), 

v£t\ u£t2 

where the last equality follows because ff does not appear among symbols of NG(v). Now we have 
that 

Dyck[CNG(t\) o CLG{t 2 )) > EE Ei = k 2 ■ Ei = E 2 , 

v£t\ u£t2 

where we use Claim [TJ If we have equality, it means that we have equality in all k 2 invocations of 
Claim [71 which implies that v € N(u) for all v E fi, u E t 2 . And this gives that there is a biclique 
between vertices of t\ and t 2 . Also, it is possible to verify that we can achieve the equality if there 
is a biclique. □ 

Let E 3 = 3(Li + E 2 ). 

Claim 9. For any triple of k-cliques t a , tp, t 1 E Ck, the union t a L)tpL) 1 7 is a 3k-clique, then 

Dyck{ x a k CG Q (t a )( y a ')^ 
o x^CGp(tp)(yp') £ s 
o x/ 5 GG 7 (i 7 )(y 7 ') 4 ) = E 3 

and > E 3 otherwise. 

Proof. We need to lower bound 

Dyck( CNG ai (t a ) Z %g% CNG a p(t a ) g^^OO^y *')* 5 

! a pY 3 CLG a p(tp)(g' a pY^% CNGp^tp^yp^iWy^yp'y* 

x/ 6 c* 4 (x 7 ') 4 (g £ 7 ) 4 CLG / 37 (t 7 )(g' / 37 ) £ 3(g' Q7 )^ CLC Q 7 (t 7 )(g^ 7 )^ y 7 4(c')^(y 7 ')4 ). 
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Assume that some symbol a is aligned to a symbol to the right of x a '. Then sequence (x a 'Y 5 
will contribute > 4/2 > E% to the Dyck score and we are done. (We prove later that we can 
achieve Dyck score E 3 if there is a clique.) Now let r denote the number of symbols from sequence 
j = x a ls a. l 4 (x a 'Y 5 that are aligned to a symbol that does not belong to sequence j. Let s denote 
the set of these r symbols. Let l denote symbols x a that are aligned to a symbol x a '. There are 
24 + 4 — r — 21 symbols from j that are not considered yet. These symbols are not matched 
to their counterparts and, therefore, contribute at least \(2 £5 + £4 — r — 21)/2] to the Dyck score 
(we divide by 2 because the symbols can be mismatched among themselves in pairs). We have 
that |"(24 +4 — r — 2 Z)/ 2 ~| + r > 4/2 + |Y/ 2 ] by the definition of l (it implies that l < 4 )- 
We note that 4/2 = Dyck(j). Dyck(j) < 4/2 can be obtained by aligning symbols x Q with 
symbols x a ' and mismatching symbols a in pairs. The reverse inequality follows by observing that 
all symbols a will be mismatched. Also we note that, if we mismatch symbols from s among 
themselves, this costs \r/ 2]. All this gives that we can assume that symbols in j do not interact 
with symbols that are not in j when we want to bound the Dyck score. Similarly, we can argue 
when j = ya 4 (a / ) 4 (ya') 4 ^ 7 4 b^(x 7 / ) 4 ,y / 3 T (b / ) 4 (y^') 4 J x 7 4 c^(x 7 / ) 4 ,y 7 4 (c 0 4 (y 7 / ) 4 - Thus, we 
need to show that 


Dyck( 



CNG ai [t a ) 

g 4 7 

Sa/3 

CNG a p(t a ) 

Sa/3 

(g af})* 3 

CLG a p(tp) 

(g a /?) 4 

4r 

CNGp 1 (tp) 

4t 

(g '^Y 3 

CLGp 7 (t 7 ) 

(g/3 7 ) 4 

(g' Q7 ) 4 

CLG erf m) 

(§cry 


) > 2>Eo 


with equality iff there is a biclique between vertices of t a and 4 , between vertices of t a and f 7 and 
between vertices of 4 and i 7 . Let h be the argument to Dyck function, that is, we want to show 
that Dyck{h) > 3i4 with the stated condition for the equality. 

Consider three gadgets CNG and three gadgets CLG as above. We can assume that no symbol 
of any of these six gadgets is aligned to any symbol g xy or g' . Assume that it is not the case. Then 
we can delete all symbols from the gadgets that are aligned to symbols g xy or g' xy . After this, we 
rematch g xy or g' xy among themselves. We can check that we can always make this rematching of 
symbols g xy or g' xy so that the cost do not increase. Furthermore, if some CNG xy gadget is aligned 
with a CNG x / y i (or CLG x > y /) for (x,y) / (x',y'), then there are two substrings of the type g a b or 
g' ab that don’t have their counterpart between GNG xy and CNG x ' y i (or CLG x > y i)- Hence, their 
contribution to the Dyck score is at least 24/2 =4 > 3Ez- Thus for all (x,y) the only gadget 
that CNG xy can be aligned with is CLG xy and vice versa. This means that we can assume that 
all the g and g 1 symbols are completely aligned. 

We have shown that the Dyck cost of the string is exactly 


Dyck(C NG ai (t a ) o CLG ai {t a )) 

+Dyck(CNG a p(tp) o CLG a p(tp)) 

+Dyck(CNGp y (tr r ) o CLG^(t 7 )). 

We want to show that this is > 3i4 with equality iff t a U 4 U t 7 is a 3/c-clique. This was shown 
in Claim 0 □ 

We now turn to the proof of the claim about the behavior of our “selection gadgets”. Let Eq 
be a fixed integer to be defined later that depends on n, N, and k. 
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Claim 10. If the G contains a 3k-clique, then Dyck(Sc) = Ec and Dyck{So ) > Eq otherwise. 

The proof of this claim will require several claims and lemmas. We start with some lemmas 
about general properties of Dyck Edit Distance. 

Lemma 3. Let Z\ be a substring of sequence Z. Assume that Z\ is of even length. If all symbols 
symbols from Z\ participate in mismatches and deletions only, then we can modify the alignment 
so that there is no symbol in Z\ that is aligned to a symbol that is not in Z\. 

Proof. Let l denote number of symbols from Z\ that are aligned to symbols that are outside of Z\. 
Let s denote the set of all these symbols outside of Z\ that are aligned to Z\. There are two cases 
to consider. 

• l is even. Then the Dyck cost induced by symbols in s and in Z\ is at least S± := l + (\Z\\ — l)/2 
by the properties from the statement of the lemma. We do the following modification to the 
alignment. We align all symbols in s among themselves in pairs. This induces cost S 2 := 1/2. 
We align all symbols in Z\ among themselves. This induces cost S 3 = \Z\\/2. The total 
induced cost after the modification is S 2 + S 3 < S± and we satisfy the requirement in the 
lemma. 

• l is odd. Then the Dyck cost induced by symbols in s and in Z\ is at least S\ := l + {\Z\ \ — 
l + l)/2. We do the following modification to the alignment. We align all symbols in s 
among themselves in pairs except one symbol, which we delete (remember that l is odd). This 
induces cost S 2 := (l + l)/2. We align all symbols in Z\ among themselves. This induces cost 
S 3 = \Z\ |/2. The total induced cost after the modification is S 2 + S 3 < S 1 and we satisfy the 
requirement in the lemma. 

□ 


Consider optimal alignment of some string w E (£ U X/)*. 

Lemma 4 . Let Z be a maximal substring of w consisting entirely of symbols z (for some symbol 
z appearing in w). Let Zq be a maximal substring of w consisting entirely of symbols zq. Let 
\Z\ = \Zq\ and let there are matches between Z and Zq. Also, there are no other maximal substrings 
of w containing z other than Z. Then we can increase the Dyck score by at most 2 by modifying 
the alignment and get that the symbols of Zq, that are matched to z, form a substring of Zq and 
the substring is suffix or prefix of Zq (we can choose whether it is a suffix of a prefix). We can also 
assume that the rest of symbols in Zq are deleted or mismatched among themselves. 

Proof. Wlog, we will show that we can make the substring to be the suffix. We write Zq = Z1Z2Z3 
so that the first symbol of Z 2 is the first symbol of Zq that is aligned to some symbol z and the 
last symbol of Z2 is the last symbol of Zq that is aligned to some symbol z. First, we modify the 
alignment as follows. If the length of Z\ is even, we don’t do anything. Otherwise, consider the first 
symbol of Z\. If it is aligned to some symbol, delete the first symbol of Z\ and the symbol aligned 
to it by increasing the Dyck cost by 1. Similarly delete the last symbol of Z 3 and the symbol 
aligned to it if Z 3 is of odd length. Now, by Lemma [3J we can assume that all symbols in Z\ and Z 3 
are mismatched among themselves (except, possibly, the first symbol of Z\ and the last symbol of 
Z 3 ). Now we are at the state when all symbols from Zq are mismatched among themselves except 
few that are matched with symbols z from Z. Now we can rematch these symbols from Z with 
symbols zq from Zq so that zq come from suffix of Zq. We see that this do not increase the Dyck 
score besides two possible deletions. □ 
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Now we prove some claims about the properties of the optimal alignment of Sq- These claims 
essentially show that any “bad behaviour”, in which we do not align exactly one clique gadget of 
each type, is suboptimal. 

Let Ei be some value that depends only on n and k. 

Claim 11. For any gadget CGp(t) from the sequence, if none of the x^,y p symbols are matched 
to their counterparts, then all the b symbols from CGp(t) must be matched to their counterparts b' 
from CGp(t), and in this case, the gadget contributes Ei to the Dyck score. Analogous claims hold 
for a, 7 . 

Proof. By Lemma [3] we can assume that all symbols x^ are mismatched among themselves. The 
same we can say about symbols yp. Let Z be the substring of the gadget between symbols 
and yp. If Z has matches with some symbols, then, by the construction of w, no symbol b or 
b' is matched to its counterpart and by Lemma [3], we get that b and W are mismatched within 
themselves. But now we can decrease the Dyck score by deleting all symbols from Z and all symbols 
that Z is aligned to outside the gadget. This increases the Dyck score but then we can decrease 
it substantially by matching all symbols b to their counterparts b 7 from the gadget. In the end we 
get smaller Dyck score because £4 > 100|Z|. Now it remains to consider the case when symbols 
in Z do not participate in matches. But then by Lemma El we again conclude that symbols in Z 
participate only in mismatches and only among themselves. Let s be the union of all symbols b 
and W of the gadget and all symbols that these symbols are aligned to outside the gadget. Let l 
denote the number of symbols from s that are not coming from the gadget. Consider two cases. 

• 1 = 0. We satisfy requirements of the claim by matching all symbols b to their counterparts 

b'. 

• There is no symbol among the l > 1 symbols that participate in a match. We have that all 
symbols in s contribute at least l to the Dyck score. We modify the alignment as follows. 
We match all symbols b to their counterparts b'. We match the rest l symbols from s among 
themselves. If there is odd number of them, we delete one. This contributes at most S± := 
(l + l)/2 to the Dyck score after the modification. We have that Si < l for l > 1. 

• Complement of the previous two cases: there is a symbol among the l > 1 symbols that 
participates in a match. Wlog, symbols b from the gadget participate in at least one matching. 
Then all symbols b 7 from the gadget do not participate in any matching and by Lemma El we 
have that all symbols b' from the gadget are mismatched among themselves. Therefore, we 
can assume that l < £ 4 . We need to consider two subcases. 

— I is even. The symbols in s contribute at least S± = ( 2 I 4 — l)/ 2 to the Dyck score. We do 
the following modification to the algorithm. We match all symbols b to their counterparts 
b 7 . We mismatch l symbols in pairs among themselves. After the modification, the Dyck 
contribution of symbols from s is S 2 := 1/2. We see that S 2 < S±. 

— I is odd. The symbols in s contribute at least S± = ( 2^4 — l + l)/2 to the Dyck score (at 
least one symbol is deleted because 2^4 — l is odd). We do the following modification to 
the algorithm. We match all symbols b to their counterparts b 7 . We mismatch l symbols 
in pairs among themselves except that we delete one symbol. After the modification, the 
Dyck contribution of symbols from s is S 2 ■= (l + l)/2. We see that S 2 < S\. 

Because symbols in between b and b 7 don’t have their counterparts among themselves, the gadget 
contributes Ei := (\CGp(t)\ — 21 4 )/2 to the Dyck cost. This quantity only depends on n and k and 
this can be verified from the construction of So- □ 
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Claim 12. In some optimal alignment, we can assume that there is some symbol xp that is aligned 
to its counterpart x'p. Analogous statements can be proved about symbols yp, y'p, x a , x' a , y a , y' a , 
x r Yr 

Proof. Suppose that xp is not aligned to any symbol x'p. We will modify the alignment so that 
all symbols xp are aligned to x' j3 coming from the first substring / of Sq consisting entirely of 
x'p. From the statement we have that all symbol xp and all symbols from / are mismatched or 
deleted. Therefore, by Lemma[3]we get that all symbols xp are mismatched among themselves and 
all symbols in / are mismatched among themselves. Now we modify the alignment as follows to 
achieve our goal. We delete all symbols b between xp and /. Also, we delete all the symbols that 
the deleted bs were aligned to before the deletion. This increases Dyck cost by at most 21^. Then 
we match all symbols to x' g in pairs. This decreases Dyck cost by £5. As a result we decreases 
the Dyck cost because £5 2 ^ 4 . □ 

Claim 13. In some optimal alignment, there is a symbol x 'p that is matched to a symbol xp and 
there is a symbol y p that is matched to a symbol y'p so that the symbols x'p and y p come from the 
same gadget CGp(t). Analogous statements can be proved about symbols x a , x' a , y a , y' a , x 7 , x 7 , 

Yi, Yr 

Proof. By Claim [T 2 ], xp is aligned to some sequence consisting of x' g . Suppose that the sequence 
comes from gadget CGp{t\) for some t\. Also, by Claim[[2l y'p is aligned to some sequence consisting 
of yp. Suppose that the sequence comes from gadget CGpit.f) for some t 2 - We want to prove that 
t\ = t 2 - Suppose that this is not the case and CGp(ti ) comes to the left of CGp{t 2 ) (the order 
can’t be reverse by the construction of Sg and because the alignments can’t cross). Suppose that 
there is some other gadget CGp{tj ,) in between CGp{t\) and CGp{t, 2 ). Then we can verify that 
CGp{t\ ) satisfy the conditions of Claim fill and we can assume that all symbols in CGp(t 3 ) are 
aligned with symbols in CGp(t 3 ). Therefore, we can remove gadget CGp(t 3 ) from Sq (because it 
does not interact with symbols outside it) and this decreases the Dyck cost by £). We do that until 
CGpiti) is to the left from CGp{t 2 ) and they are neighboring. Now we will change the alignment 
so that xp is aligned to a symbol x^ from CGp{t 2 ) and as a result we will decrease Dyck cost. Now 
we can verify from the construction of Sq that symbols b, yp, b' from CGp{t\ ) do not participate 
in matches. Also, symbols b and x'^ from CGp{t 2 ) do not participate in matches. By Lemma[3l we 
conclude that all these symbols have mismatches among themselves. Let g be the sequence between 
symbols x'p and yp in CGp(t\). Now we modify the alignment so that the symbols in sequence g 
have mismatches only among themselves and the symbols that were aligned to symbols in g are 
deleted. This increases Dyck cost by at most 10|<y| < 100^3 =: S\. Some symbols x' g are aligned 
to symbols outside CGp{t\). We transfer these alignmets of symbols x^ from CGp(t\) to symbols 
x'p in CGp{t 2 ) so that we only have mismatches among x'p in CGp{t\) and we don’t change the 
Dyck cost. Now we can align all symbols b in CGp{t\) to b' in CGp(t 1 ) in pairs. This decreases 
the Dyck cost by £4. In the end, we decreased the Dyck cost by £4 — S\ > 0 and we proved what 
we wanted. □ 

Claim 14. In some optimal alignment, the only symbols x'p that symbols xp are matched to, come 
from the same gadget CGp(t), and the only symbols yp that symbols y'p are matched to, come from 
the same gadget CGp(t). In both cases it is the same gadget CGp(t). Analogous statements can be 
proved about symbols x a , x' a , y a , y' a , x 7 , x(,, y 1 , y 7 . 
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Proof. Suppose that xp is matched to symbols x'p coming from two different gadgets CGp{t\) and 
CGp(t 2 ). CGfj(t i) comes earlier in Sq than CGp(t 2 ). Assume that there is no gadget CGp{t^) in 
between CGp(t i) and CGp(t 2 ) in sequence Sq- We can make this assumption because otherwise we 
can remove CGpfo) as in Claim fill We can check that all symbols x'p and yp from CGp(t 3) do not 
participate in matches and thus we satisfy the requirements of Claim [TTJ Now we can check that 
symbols b, jp, b' in CGp{t\) do not participate in matches. Also, symbols between x'p and y p in 
CGp(t \) do not participate in matches. Also, symbols W do not participate in matches. Therefore, 
by Lemma [3] we conclude that all these symbols participate in mismatches only among themselves. 
Let X\ denote sequence of x'p from CGp(t 1 ) and X 2 denote sequence of x'p from CGp{t, 2 )• By 
Lemma IU we can assume that symbols x'p from X\ and X 2 that are matched to xp form suffix in 
both sequences and the rest of symbols in both sequences are mismatched among themselves or 
deleted. The corresponding modifications increases the Dyck cost by at most < 4 =: S\. Let Z\ 
be the suffix of X\ and Z 2 be the suffix of A^. \Z^\ \Z 2 \ is less or equal to the total number of 

symbols xp in Sq by the construction of Sq- Suppose that \Z\\ is even. We mismatch all symbols 
in Z\ among themselves and match resulting unmatched \Zf\ symbols xp to x'p from A). This does 
not change Dyck cost. Suppose that \Z\\ is odd, then there is a deletion among symbols in Ai that 
are not in Z\. We do mismatches among symbols Z\ and the one deleted. We match the resulting 
unmatched \Z\ \ symbols xp to x'p from A 2 . We can check that we can do this matching so that the 
Dyck cost do not increase. Now we can match all symbols b from CGp{t \) to their counterparts 
b' in CGp{t\). This decreases the Dyck cost by S 2 := £ 4 . In total, we decrease the Dyck cost by 

> S 2 - Si > 0. □ 

We note that in the proof of Claim [IT] we remove 3A — 3 cliques from the graph (A is the 
number of A:-cliques in the graph), each removal costing £). After all the removals, we arrive to a 
sequence of form as required in Claim [9] Thus, we set Ec := (3A — 3 )E[ + E c and our proof for 
Claim m is finished. 

We are now ready to show that the construction of Sq from the graph G proves Theorem [3] 
Reminder of Theorem [3] If Dyck edit distance on a sequence of length n can be solved in T(n) 
time, then 3k-Clique on n node graphs can be solved in O (T (n fc+ °^)) time, for any k > 1. 
Moreover, the reduction is combinatorial. 

Proof. Given a graph G on n nodes we construct the sequence Sq as described above. The sequence 
can be constructed in 0 (k° ^ • n k+0<y 1 ^) time, by enumerating all subsets of k nodes and that it has 
length . n k +°P)f Thus, an algorithm for Dyck Edit Distance as in the statement returns 

Dyck score of Sq in T(n fc// 3 + °B)) time (treating A; as a constant) and by Claim [TO] this score 
determines whether G contains a 3/c-clique. All the steps in our reduction are combinatorial. □ 
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